根据 r 中列的值添加缺失值
add missed value based on the value of the column in r
这是我的示例数据集:
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2)
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2)
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1)
)
list <- list(vector1, vector2, vector3)
print(list)
这是我的测试:
default <- c("cherry",
"orange",
"apple",
"mango")
for (num in 1:length(list)) {
#print(list[[num]])
list[[num]] <- rbind(
list[[num]],
data.frame(
"name" = list[[num]]$name,
"age" = list[[num]]$age,
"fruit" = setdiff(default, list[[num]]$fruit),#add missed value
"count" = 0,
"tag" = 1 #not found solutions
)
)
print(paste0("--------------", num, "--------"))
print(list)
}
#print(list)
我试图在数据框中找到哪个水果未命中,并且水果基于 tag.For 示例的值,在第一个数据框中,有标签 1 和 2.If 标签1的值没有苹果香蕉等默认水果,漏掉的默认水果会补0到数据中frame.The 期望格式如下:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 apple 0 1
6 a 10 mango 0 2
7 a 10 orange 0 2
8 a 10 cherry 0 2
查看循环的过程,我也发现第一个循环加了3次芒果,没找到为什么一次加不了漏值的原因time.The总输出赞以下:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 mango 0 1
6 a 10 mango 0 1
[[2]]
name age fruit count tag
1 b 33 apple 1 2
2 b 33 mango 1 2
3 b 33 cherry 0 1
4 b 33 orange 0 1
[[3]]
name age fruit count tag
1 c 58 cherry 1 1
2 c 58 apple 1 1
3 c 58 orange 0 1
4 c 58 mango 0 1
有没有人帮我,提供简单的方法或者其他方法?我应该使用sqldf函数添加0值吗?这是解决我问题的简单方法吗?
使用 dplyr and tidyr 的解决方案。我们可以使用 complete
来扩展数据框并指定填充值为 0 到 count
.
请注意,我将您的列表名称从 list
更改为 fruit_list
,因为在 R 中使用保留字来命名对象是一种不好的做法。另请注意,当我创建示例数据框时,我设置了 stringsAsFactors = FALSE
因为我不想创建因子列。最后,我使用 lapply
而不是 for-loop 来循环遍历列表元素。
library(dplyr)
library(tidyr)
fruit_list2 <- lapply(fruit_list, function(x){
x2 <- x %>%
complete(name, age, fruit = default, tag = c(1, 2), fill = list(count = 0)) %>%
select(name, age, fruit, count, tag) %>%
arrange(tag, fruit) %>%
as.data.frame()
return(x2)
})
fruit_list2
# [[1]]
# name age fruit count tag
# 1 a 10 apple 0 1
# 2 a 10 cherry 1 1
# 3 a 10 mango 0 1
# 4 a 10 orange 1 1
# 5 a 10 apple 1 2
# 6 a 10 cherry 0 2
# 7 a 10 mango 0 2
# 8 a 10 orange 0 2
#
# [[2]]
# name age fruit count tag
# 1 b 33 apple 0 1
# 2 b 33 cherry 0 1
# 3 b 33 mango 0 1
# 4 b 33 orange 0 1
# 5 b 33 apple 1 2
# 6 b 33 cherry 0 2
# 7 b 33 mango 1 2
# 8 b 33 orange 0 2
#
# [[3]]
# name age fruit count tag
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 0 1
# 4 c 58 orange 0 1
# 5 c 58 apple 0 2
# 6 c 58 cherry 0 2
# 7 c 58 mango 0 2
# 8 c 58 orange 0 2
数据
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2),
stringsAsFactors = FALSE
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2),
stringsAsFactors = FALSE
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1),
stringsAsFactors = FALSE
)
fruit_list <- list(vector1, vector2, vector3)
default <- c("cherry", "orange", "apple", "mango")
考虑基础 R 方法——lapply
、expand.grid
、transform
、rbind
、aggregate
——它附加所有可能的 fruit 和 tag 选项到每个数据帧并保持最大计数。
new_list <- lapply(list, function(df) {
fruit_tag_df <- transform(expand.grid(fruit=c("apple", "cherry", "mango", "orange"),
tag=c(1,2)),
name = df$name[1],
age = df$age[1],
count = 0)
aggregate(.~name + age + fruit + tag, rbind(df, fruit_tag_df), FUN=max)
})
输出
new_list
# [[1]]
# name age fruit tag count
# 1 a 10 apple 1 0
# 2 a 10 cherry 1 1
# 3 a 10 orange 1 1
# 4 a 10 mango 1 0
# 5 a 10 apple 2 1
# 6 a 10 cherry 2 0
# 7 a 10 orange 2 0
# 8 a 10 mango 2 0
# [[2]]
# name age fruit tag count
# 1 b 33 apple 1 0
# 2 b 33 mango 1 0
# 3 b 33 cherry 1 0
# 4 b 33 orange 1 0
# 5 b 33 apple 2 1
# 6 b 33 mango 2 1
# 7 b 33 cherry 2 0
# 8 b 33 orange 2 0
# [[3]]
# name age fruit tag count
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 1 0
# 4 c 58 orange 1 0
# 5 c 58 apple 2 0
# 6 c 58 cherry 2 0
# 7 c 58 mango 2 0
# 8 c 58 orange 2 0
OP 已要求完成 list
中的每个 data.frame,以便 default
水果和标签 1:2
的所有组合将出现在结果中 [=19] =] 应设置为 0
用于其他行。最后,每个 data.frame 应该至少包含 4 x 2 = 8 行。
我想提出两种不同的方法:
- 使用
lapply()
和 CJ()
(cross join)函数从 data.table
到 return 列表。
- 使用
rbindlist()
将 list
中的单独 data.frame 合并为 一个 大 data.table 并应用所需的转换整个data.table.
使用 lapply()
和 CJ()
library(data.table)
lapply(lst, function(x) setDT(x)[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)][
is.na(count), count := 0][order(-count, tag)]
)
[[1]]
name age fruit count tag
1: a 10 cherry 1 1
2: a 10 orange 1 1
3: a 10 apple 1 2
4: a 10 apple 0 1
5: a 10 mango 0 1
6: a 10 cherry 0 2
7: a 10 mango 0 2
8: a 10 orange 0 2
[[2]]
name age fruit count tag
1: b 33 apple 1 2
2: b 33 mango 1 2
3: b 33 apple 0 1
4: b 33 cherry 0 1
5: b 33 mango 0 1
6: b 33 orange 0 1
7: b 33 cherry 0 2
8: b 33 orange 0 2
[[3]]
name age fruit count tag
1: c 58 apple 1 1
2: c 58 cherry 1 1
3: c 58 mango 0 1
4: c 58 orange 0 1
5: c 58 apple 0 2
6: c 58 cherry 0 2
7: c 58 mango 0 2
8: c 58 orange 0 2
不需要按 count
和 tag
排序,但有助于将结果与 OP 的预期输出进行比较。
在大 data.table
上创作
我们可以使用一个大的data.table,而不是具有相同结构的data.frame列表,其中每一行的来源可以通过id 列。
确实,OP 提出了其他问题(“使用 lapply 函数并在 r 中列出”
和 where he asked for help in handling a list of data.frames. 已经建议将 rbind
行放在一起。
rbindlist()
函数具有 idcol
参数,用于标识每一行的来源:
library(data.table)
rbindlist(list, idcol = "df")
df name age fruit count tag
1: 1 a 10 orange 1 1
2: 1 a 10 cherry 1 1
3: 1 a 10 apple 1 2
4: 2 b 33 apple 1 2
5: 2 b 33 mango 1 2
6: 3 c 58 cherry 1 1
7: 3 c 58 apple 1 1
请注意,df
包含 list
中源 data.frame 的编号(如果命名了 list
,则包含列表元素的名称)。
现在,我们可以通过对 df
:
进行分组来应用上述解决方案
rbindlist(list, idcol = "df")[, .SD[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)], by = df][
is.na(count), count := 0][order(df, -count, tag)]
df name age fruit count tag
1: 1 a 10 cherry 1 1
2: 1 a 10 orange 1 1
3: 1 a 10 apple 1 2
4: 1 a 10 apple 0 1
5: 1 a 10 mango 0 1
6: 1 a 10 cherry 0 2
7: 1 a 10 mango 0 2
8: 1 a 10 orange 0 2
9: 2 b 33 apple 1 2
10: 2 b 33 mango 1 2
11: 2 b 33 apple 0 1
12: 2 b 33 cherry 0 1
13: 2 b 33 mango 0 1
14: 2 b 33 orange 0 1
15: 2 b 33 cherry 0 2
16: 2 b 33 orange 0 2
17: 3 c 58 apple 1 1
18: 3 c 58 cherry 1 1
19: 3 c 58 mango 0 1
20: 3 c 58 orange 0 1
21: 3 c 58 apple 0 2
22: 3 c 58 cherry 0 2
23: 3 c 58 mango 0 2
24: 3 c 58 orange 0 2
df name age fruit count tag
这是我的示例数据集:
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2)
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2)
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1)
)
list <- list(vector1, vector2, vector3)
print(list)
这是我的测试:
default <- c("cherry",
"orange",
"apple",
"mango")
for (num in 1:length(list)) {
#print(list[[num]])
list[[num]] <- rbind(
list[[num]],
data.frame(
"name" = list[[num]]$name,
"age" = list[[num]]$age,
"fruit" = setdiff(default, list[[num]]$fruit),#add missed value
"count" = 0,
"tag" = 1 #not found solutions
)
)
print(paste0("--------------", num, "--------"))
print(list)
}
#print(list)
我试图在数据框中找到哪个水果未命中,并且水果基于 tag.For 示例的值,在第一个数据框中,有标签 1 和 2.If 标签1的值没有苹果香蕉等默认水果,漏掉的默认水果会补0到数据中frame.The 期望格式如下:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 apple 0 1
6 a 10 mango 0 2
7 a 10 orange 0 2
8 a 10 cherry 0 2
查看循环的过程,我也发现第一个循环加了3次芒果,没找到为什么一次加不了漏值的原因time.The总输出赞以下:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 mango 0 1
6 a 10 mango 0 1
[[2]]
name age fruit count tag
1 b 33 apple 1 2
2 b 33 mango 1 2
3 b 33 cherry 0 1
4 b 33 orange 0 1
[[3]]
name age fruit count tag
1 c 58 cherry 1 1
2 c 58 apple 1 1
3 c 58 orange 0 1
4 c 58 mango 0 1
有没有人帮我,提供简单的方法或者其他方法?我应该使用sqldf函数添加0值吗?这是解决我问题的简单方法吗?
使用 dplyr and tidyr 的解决方案。我们可以使用 complete
来扩展数据框并指定填充值为 0 到 count
.
请注意,我将您的列表名称从 list
更改为 fruit_list
,因为在 R 中使用保留字来命名对象是一种不好的做法。另请注意,当我创建示例数据框时,我设置了 stringsAsFactors = FALSE
因为我不想创建因子列。最后,我使用 lapply
而不是 for-loop 来循环遍历列表元素。
library(dplyr)
library(tidyr)
fruit_list2 <- lapply(fruit_list, function(x){
x2 <- x %>%
complete(name, age, fruit = default, tag = c(1, 2), fill = list(count = 0)) %>%
select(name, age, fruit, count, tag) %>%
arrange(tag, fruit) %>%
as.data.frame()
return(x2)
})
fruit_list2
# [[1]]
# name age fruit count tag
# 1 a 10 apple 0 1
# 2 a 10 cherry 1 1
# 3 a 10 mango 0 1
# 4 a 10 orange 1 1
# 5 a 10 apple 1 2
# 6 a 10 cherry 0 2
# 7 a 10 mango 0 2
# 8 a 10 orange 0 2
#
# [[2]]
# name age fruit count tag
# 1 b 33 apple 0 1
# 2 b 33 cherry 0 1
# 3 b 33 mango 0 1
# 4 b 33 orange 0 1
# 5 b 33 apple 1 2
# 6 b 33 cherry 0 2
# 7 b 33 mango 1 2
# 8 b 33 orange 0 2
#
# [[3]]
# name age fruit count tag
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 0 1
# 4 c 58 orange 0 1
# 5 c 58 apple 0 2
# 6 c 58 cherry 0 2
# 7 c 58 mango 0 2
# 8 c 58 orange 0 2
数据
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2),
stringsAsFactors = FALSE
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2),
stringsAsFactors = FALSE
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1),
stringsAsFactors = FALSE
)
fruit_list <- list(vector1, vector2, vector3)
default <- c("cherry", "orange", "apple", "mango")
考虑基础 R 方法——lapply
、expand.grid
、transform
、rbind
、aggregate
——它附加所有可能的 fruit 和 tag 选项到每个数据帧并保持最大计数。
new_list <- lapply(list, function(df) {
fruit_tag_df <- transform(expand.grid(fruit=c("apple", "cherry", "mango", "orange"),
tag=c(1,2)),
name = df$name[1],
age = df$age[1],
count = 0)
aggregate(.~name + age + fruit + tag, rbind(df, fruit_tag_df), FUN=max)
})
输出
new_list
# [[1]]
# name age fruit tag count
# 1 a 10 apple 1 0
# 2 a 10 cherry 1 1
# 3 a 10 orange 1 1
# 4 a 10 mango 1 0
# 5 a 10 apple 2 1
# 6 a 10 cherry 2 0
# 7 a 10 orange 2 0
# 8 a 10 mango 2 0
# [[2]]
# name age fruit tag count
# 1 b 33 apple 1 0
# 2 b 33 mango 1 0
# 3 b 33 cherry 1 0
# 4 b 33 orange 1 0
# 5 b 33 apple 2 1
# 6 b 33 mango 2 1
# 7 b 33 cherry 2 0
# 8 b 33 orange 2 0
# [[3]]
# name age fruit tag count
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 1 0
# 4 c 58 orange 1 0
# 5 c 58 apple 2 0
# 6 c 58 cherry 2 0
# 7 c 58 mango 2 0
# 8 c 58 orange 2 0
OP 已要求完成 list
中的每个 data.frame,以便 default
水果和标签 1:2
的所有组合将出现在结果中 [=19] =] 应设置为 0
用于其他行。最后,每个 data.frame 应该至少包含 4 x 2 = 8 行。
我想提出两种不同的方法:
- 使用
lapply()
和CJ()
(cross join)函数从data.table
到 return 列表。 - 使用
rbindlist()
将list
中的单独 data.frame 合并为 一个 大 data.table 并应用所需的转换整个data.table.
使用 lapply()
和 CJ()
library(data.table)
lapply(lst, function(x) setDT(x)[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)][
is.na(count), count := 0][order(-count, tag)]
)
[[1]] name age fruit count tag 1: a 10 cherry 1 1 2: a 10 orange 1 1 3: a 10 apple 1 2 4: a 10 apple 0 1 5: a 10 mango 0 1 6: a 10 cherry 0 2 7: a 10 mango 0 2 8: a 10 orange 0 2 [[2]] name age fruit count tag 1: b 33 apple 1 2 2: b 33 mango 1 2 3: b 33 apple 0 1 4: b 33 cherry 0 1 5: b 33 mango 0 1 6: b 33 orange 0 1 7: b 33 cherry 0 2 8: b 33 orange 0 2 [[3]] name age fruit count tag 1: c 58 apple 1 1 2: c 58 cherry 1 1 3: c 58 mango 0 1 4: c 58 orange 0 1 5: c 58 apple 0 2 6: c 58 cherry 0 2 7: c 58 mango 0 2 8: c 58 orange 0 2
不需要按 count
和 tag
排序,但有助于将结果与 OP 的预期输出进行比较。
在大 data.table
上创作我们可以使用一个大的data.table,而不是具有相同结构的data.frame列表,其中每一行的来源可以通过id 列。
确实,OP 提出了其他问题(“使用 lapply 函数并在 r 中列出”
和 rbind
行放在一起。
rbindlist()
函数具有 idcol
参数,用于标识每一行的来源:
library(data.table)
rbindlist(list, idcol = "df")
df name age fruit count tag 1: 1 a 10 orange 1 1 2: 1 a 10 cherry 1 1 3: 1 a 10 apple 1 2 4: 2 b 33 apple 1 2 5: 2 b 33 mango 1 2 6: 3 c 58 cherry 1 1 7: 3 c 58 apple 1 1
请注意,df
包含 list
中源 data.frame 的编号(如果命名了 list
,则包含列表元素的名称)。
现在,我们可以通过对 df
:
rbindlist(list, idcol = "df")[, .SD[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)], by = df][
is.na(count), count := 0][order(df, -count, tag)]
df name age fruit count tag 1: 1 a 10 cherry 1 1 2: 1 a 10 orange 1 1 3: 1 a 10 apple 1 2 4: 1 a 10 apple 0 1 5: 1 a 10 mango 0 1 6: 1 a 10 cherry 0 2 7: 1 a 10 mango 0 2 8: 1 a 10 orange 0 2 9: 2 b 33 apple 1 2 10: 2 b 33 mango 1 2 11: 2 b 33 apple 0 1 12: 2 b 33 cherry 0 1 13: 2 b 33 mango 0 1 14: 2 b 33 orange 0 1 15: 2 b 33 cherry 0 2 16: 2 b 33 orange 0 2 17: 3 c 58 apple 1 1 18: 3 c 58 cherry 1 1 19: 3 c 58 mango 0 1 20: 3 c 58 orange 0 1 21: 3 c 58 apple 0 2 22: 3 c 58 cherry 0 2 23: 3 c 58 mango 0 2 24: 3 c 58 orange 0 2 df name age fruit count tag